The purpose of this project is to analyze the determinants of car price. During the pandemic the used car market was sent surging due to supply shortages. This article has some good information on this (https://spectrumlocalnews.com/nys/central-ny/news/2024/05/02/used-vehicles-during-pandemic) This lead me to think of one question, “what are the main determinants for the price of a car?” Through this project the question will be answered with not only machine learning models but with compelling data visualizations to illustrate the point. There will also be a section at the end to of this analysis to predict car prices using the model developed in this project.
Data Description
The data comes from Kaggle. This dataset encompasses details such as the year, make, model, trim, body type, transmission type, VIN (Vehicle Identification Number), state of registration, condition rating, odometer reading, exterior and interior colors, seller information, Manheim Market Report (MMR) values, selling prices, and sale dates.
Results
There are many important variables that influence car price. This ranges from the odometer and the body of the car all the way to the trim level. Market conditions also play a huge roll as we will see below. A change in the Manheim Market Report value can cause a huge change in the price of the vehicle since. So supply side shortages or a widely successful campaign could influence car price. For more research down the line, larger data sets and more variables will be required.
Further Analysis
To build on this analysis we can look and see how time has an effect on consumer car demand. If cars have higher selling prices at different points of the year, buyers can use this to there advantage and get a better deal for said vehicle in the market.
— beige black blue brown burgundy gold
749 17077 59758 244329 1143 8640 191 324
gray green off-white orange purple red silver tan
178581 245 480 145 339 1363 1104 44093
white yellow
256 20
Checking on the missing variables
Code
gg_miss_var(cars)
Code
library(skimr)
Warning: package 'skimr' was built under R version 4.4.3
Attaching package: 'skimr'
The following object is masked from 'package:naniar':
n_complete
The following object is masked from 'package:mdsr':
skim
Code
skim(cars)
Warning: There was 1 warning in `dplyr::summarize()`.
ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
mangled_skimmers$funs)`.
ℹ In group 0: .
Caused by warning:
! There were 9 warnings in `dplyr::summarize()`.
The first warning was:
ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
mangled_skimmers$funs)`.
Caused by warning in `sorted_count()`:
! Variable contains value(s) of "" that have been converted to "empty".
ℹ Run `dplyr::last_dplyr_warnings()` to see the 8 remaining warnings.
Data summary
Name
cars
Number of rows
558837
Number of columns
16
_______________________
Column type frequency:
factor
11
numeric
5
________________________
Group variables
None
Variable type: factor
skim_variable
n_missing
complete_rate
ordered
n_unique
top_counts
make
0
1
FALSE
97
For: 93554, Che: 60197, Nis: 53946, Toy: 39871
model
0
1
FALSE
974
Alt: 19349, F-1: 14479, Fus: 12946, Cam: 12545
trim
0
1
FALSE
1964
Bas: 55817, SE: 43648, LX: 20757, Lim: 18367
body
0
1
FALSE
88
Sed: 199437, SUV: 119292, sed: 41906, suv: 24552
transmission
0
1
FALSE
5
aut: 475915, emp: 65352, man: 17544, sed: 15
vin
0
1
FALSE
550298
aut: 22, wba: 5, emp: 4, 1ft: 4
state
0
1
FALSE
64
fl: 82945, ca: 73148, pa: 53907, tx: 45913
color
0
1
FALSE
47
bla: 110970, whi: 106673, sil: 83389, gra: 82857
interior
0
1
FALSE
18
bla: 244329, gra: 178581, bei: 59758, tan: 44093
seller
0
1
FALSE
14263
nis: 19693, for: 19162, the: 18299, san: 15285
saledate
0
1
FALSE
3767
Tue: 5334, Tue: 5016, Tue: 4902, Tue: 4731
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
year
0
1.00
2010.04
3.97
1982
2007
2012
2013
2015
▁▁▁▃▇
condition
11820
0.98
30.67
13.40
1
23
35
42
49
▃▂▆▇▇
odometer
94
1.00
68320.02
53398.54
1
28371
52254
99109
999999
▇▁▁▁▁
mmr
38
1.00
13769.38
9679.97
25
7100
12250
18300
182000
▇▁▁▁▁
sellingprice
12
1.00
13611.36
9749.50
1
6900
12100
18200
230000
▇▁▁▁▁
Dropping all the missing values
Code
cars <- cars |>drop_na()
Exploring relationships among features: correaltion matrix
plot(cars$pred, cars$sellingprice)abline(a =0, b =1, col ="red", lwd =3, lty =2)
Ploting the Predicted selling car prices against the actual selling price.
An analysis on SUV prices
Code
suv_brands <- cars |>filter(make =="Jeep"| make =="Ford"| make =="Land Rover"| make =="Kia"| make =="Cadillac") |>filter(model =="Wrangler"| model =="Range Rover"| model =="Explorer"| model =="Sorento"| model =="Escalade")head(suv_brands) %>%gt()
Preidcting the car price for a Kia Sorento while changing a few of the features
Code
predict(car_model, data.frame(year =2015, make ="Kia", model ="Sorento", trim ="LX", body ="SUV", state ="ca", condition =40, odometer =16000, color ="white", interior ="black", mmr =20500, transmission ="automatic"))
1
20427.94
Changing the year
Code
predict(car_model, data.frame(year =2016, make ="Kia", model ="Sorento", trim ="LX", body ="SUV", state ="ca", condition =40, odometer =16000, color ="white", interior ="black", mmr =20500, transmission ="automatic"))
1
20381.92
Changing the state
Code
predict(car_model, data.frame(year =2015, make ="Kia", model ="Sorento", trim ="LX", body ="SUV", state ="tx", condition =40, odometer =16000, color ="white", interior ="black", mmr =20500, transmission ="automatic"))
1
20427.94
Changing the odometer
Code
predict(car_model, data.frame(year =2015, make ="Kia", model ="Sorento", trim ="LX", body ="SUV", state ="ca", condition =40, odometer =16500, color ="white", interior ="black", mmr =20500, transmission ="automatic"))
1
20427.35
Changing the Mannehiam Market Report
Code
predict(car_model, data.frame(year =2015, make ="Kia", model ="Sorento", trim ="LX", body ="SUV", state ="ca", condition =40, odometer =16500, color ="white", interior ="black", mmr =22500, transmission ="automatic"))
1
22395.21
Changing the interior
Code
predict(car_model, data.frame(year =2015, make ="Kia", model ="Sorento", trim ="LX", body ="SUV", state ="ca", condition =40, odometer =16500, color ="white", interior ="white", mmr =22500, transmission ="automatic"))
1
22395.21
Changing the color
Code
predict(car_model, data.frame(year =2015, make ="Kia", model ="Sorento", trim ="LX", body ="SUV", state ="ca", condition =40, odometer =16500, color ="blue", interior ="black", mmr =22500, transmission ="automatic"))
1
22395.21
Changing the transmission
Code
predict(car_model, data.frame(year =2015, make ="Kia", model ="Sorento", trim ="LX", body ="SUV", state ="ca", condition =40, odometer =16500, color ="white", interior ="black", mmr =22500, transmission ="manual"))
1
22395.21
Some plots to visualize the data
Code
cars |>ggplot(aes(x = condition, y = sellingprice)) +geom_point()
Looking at the relationship between the condition of the car and the selling price. We can see that the better the condition the car is the higher the price it sells for. And we can also see the some cars in really bad condtion can still fetch a high price in the market.
Same as above just spliting by brand
Code
cars |>filter(make =="Jeep"| make =="Ford"| make =="Land Rover"| make =="Cadillac") |>ggplot(aes(x = condition, y = sellingprice, color = make)) +geom_point()
Looking deepper into the analysis. By splitting the dataset into SUVs. We can see the difference between Brands and how much they are demanded for at different condtions. For example we can see that regardless of condtion, Land Rover’s are highly demanded.
Same with a facet_wrap
Code
cars_face <- cars |>filter(make =="Jeep"| make =="Ford"| make =="Land Rover"| make =="Cadillac"| make =="Kia") |>ggplot(aes(x = condition, y = sellingprice, color = make)) +geom_point() +facet_wrap(~make)cars_face
Using a facet wrap to really isolate the difference in prices based on the conditions among the brands.
doing the same with individual selling price of the different car models
Code
suv_brands |>ggplot(aes(x = condition, y = sellingprice, color = make)) +geom_point() +facet_wrap(~model)
Looking at the different SUV brands and how the prices compare at different condtions.
A boxplot that shows the distribution in prices. We can also see outliers among the prices for the cars.
Code
library(plotly)library(tidyverse)p1 <- cars |>filter(make %in%c("jeep", "ford", "land rover", "cadillac", "kia")) |>ggplot(aes(x = condition, y = sellingprice)) +geom_point(aes(color = make, size = odometer, frame = year, ids = model), alpha =0.5) +labs(x ="The Condtion of the car at the time of Sale",y ="The Selling Price of the Car",color ="Brand of the car",size =NULL )